INTRODUCTION

For many students, much thought goes into deciding a college major. Some students know exactly what their passion is from a young age, while others start college completely undecided. Additionally, many students change their minds about what they want to major in while in college, even multiple times. Many factors come into play when choosing what course of study to pursue for the next four or more years. These different elements are analyzed in depth in this analysis in an attempt to provide clear, accurate information to aid future students in their search for a college major.

The first part of the analysis focuses on creating the best model in order to accurately predict the median salary of each major given a selection of variables. Given that salary can play a large role in the quality of life of a person, many students may choose a college major that is known to have higher salaries post-graduation. Other evolving factors that students now have to consider when looking at prospective salaries and majors include the share of women in the field, and the unemployment rate. Having a relatively precise model is extremely important so that students are able to make informed decisions based on reliable and accurate data.

The second part of the analysis looks at how numerical data corresponds to groups given to us based on categorical variables. Once again, based off a selection of variables, the category of the major was predicted and was also assigned a binary value, indicating if that major is part of the STEM field or not. This allows a more holistic view of each major since more than just salary is being analyzed. With this knowledge, the preconceptions of each major can be either confirmed or denied, allowing students to, once again, have reliable information to base their decisions on.

DATA

The College Majors dataset used in this analysis is collected from the Census Bureau and compiled into a set of files called the American Community Survey Public Use Microseries 2010 - 2012. This data set was analyzed by FiveThirtyEight. FiveThirtyEight is an analytics-driven internet news source. This website, whose name comes from the number of votes in the Electoral College, focuses its articles on politics, sports, science and health, economics, and culture, with the bulk of the articles focusing on politics and sports. In all of their articles, they include an analysis of data that has been collected. For politics, this is primarily polls, which they evaluate how accurate and unbiased each poll is before taking the results as factual. For sports, much of the data they use is built up from results of prior games, and for each major sports league, they maintain a ratings sySTEM that is updated after every game. For our data set that we focused our project on, they compiled data for students in 173 different majors from schools across the country and looked at different descriptive statistics for each major. Their findings were published in the article “The Economic Guide To Picking A College Major”, which gives an overview of college majors and their potential to have successful careers post graduation.

Each observation in this data corresponds to one of the 173 unique majors and is a sample of US citizens who have completed an undergraduate or graduate program. This dataset includes several variables regarding college majors, employment status, and salary post graduation. There are two categorical variables in this dataset, Major and Major Category, and 17 continuous variables. The variables studied most extensively in this analysis are Major, Unemployment Rate, Median Salary, and Share Women. A new variable, IQR, was added to the data as a potential factor for predictive modeling as well. All of these variables are associated with a major and each major consists of a varying number of individuals.

The first half of the analysis includes adding a binary variable called STEM, which would indicate if the major is considered part of the STEM discipline or not. STEM stands for Science, Technology, Engineering, and Mathematics which encompasses a multitude of studies. Majors such as biology and chemistry fall under the term natural sciences and mathematics and statistics fall under formal sciences, all of which are part of the STEM field as a whole. One deviation of this is the social sciences which are actually categorized with the humanities instead. Other majors, however, are more difficult to classify. For example, agricultural science was identified as STEM major, even though it doesn’t align with the typical ideals of a science.

## # A tibble: 6 x 10
##   Major ShareWomen Unemployment_ra~ Median Total Part_time Non_college_jobs
##   <chr>      <dbl>            <dbl>  <dbl> <dbl>     <dbl>            <dbl>
## 1 MINI~      0.102           0.117   75000   756       170              257
## 2 META~      0.153           0.0241  73000   856       133              176
## 3 NAVA~      0.107           0.0501  70000  1258       150              102
## 4 CHEM~      0.342           0.0611  65000 32260      5180             4440
## 5 NUCL~      0.145           0.177   65000  2573       264              657
## 6 ACTU~      0.441           0.0957  62000  3777       296              314
## # ... with 3 more variables: P25th <dbl>, P75th <dbl>, STEM <dbl>

Above is a histogram depicting the count of majors that fall within each salary range in $5,000 increments. Notice the outlier of Petroleum Engineering is well to the right of the rest of the histogram.

RESULTS

In order to answer the first question, we created the variables STEM and IQR, which was calculated by subtracting the P25th of each major from its P75th, the first and third quartiles of a major’s starting salary. Because Food Science did not have data on Median Salary, which is what we were trying to predict, we left it out of our models. We also left out Petroleum Engineering because it was such an extreme outlier, as evidenced by our histogram in the Data section. All of our models had their R-Squared values increase by around 3% after taking out Petroleum Engineering, and including Petroleum Engineering caused our model to overestimate nearly every other major’s median salary.

When building a predictive model for Median Salary, we built 8 different models before deciding on the best one. We started with independent variables Total (the # of people in the major), ShareWomen, Part_time, and Non_college_jobs, based on a stepwise regression. After that, we added an interaction between Total and ShareWomen, and all other interactions were insignificant. Next, we added in the IQR we calculated earlier, before finally adding in the binary variable STEM. For each of these combinations of variables, we also tried the same predictors for the log of the Median Salary as well to see if the model predicted that better. We put together the table below, and because of the R-Squared being highest, the RMSE being lowest (even taking into account the RMSE being in terms of log(Median) rather than Median), and the MAE being only negligibly higher than the MAE of ManualLogModel, we decided to use ManualLogModelwSTEM as our best model in predicting a major’s median salaries.

## 
## Call:
## lm(formula = log(Median) ~ (Total) + ShareWomen + (Part_time) + 
##     (Non_college_jobs) + IQR + STEM + I((Total) * ShareWomen), 
##     data = recentgradsnumericaldata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34275 -0.10591 -0.01206  0.08519  0.47405 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.065e+01  5.479e-02 194.449  < 2e-16 ***
## Total                    2.740e-06  9.098e-07   3.012  0.00301 ** 
## ShareWomen              -5.148e-01  6.667e-02  -7.721 1.10e-12 ***
## Part_time               -1.070e-05  3.231e-06  -3.313  0.00114 ** 
## Non_college_jobs        -4.142e-06  1.599e-06  -2.591  0.01043 *  
## IQR                      6.537e-06  1.306e-06   5.005 1.44e-06 ***
## STEM                     5.985e-02  2.792e-02   2.144  0.03355 *  
## I((Total) * ShareWomen)  1.970e-06  1.241e-06   1.587  0.11444    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1537 on 163 degrees of freedom
## Multiple R-squared:  0.5983, Adjusted R-squared:  0.5811 
## F-statistic: 34.69 on 7 and 163 DF,  p-value: < 2.2e-16
##                 Model RSquared  pValue          MAE         RMSE
## 1       OriginalModel   0.4742 2.2e-16 5540.3000351 7345.1259668
## 2    OriginalLogModel   0.4975 2.2e-16    0.1301913    0.1678341
## 3    InteractionModel   0.4803 2.2e-16 5475.4039513 7301.9643314
## 4 InteractionLogModel   0.5046 2.2e-16    0.1287521    0.1666438
## 5         ManualModel   0.5706 2.2e-16 4993.8911713 6637.7848105
## 6      ManualLogModel   0.5870 2.2e-16    0.1187236    0.1521589
## 7    ManualModelwSTEM   0.5783 2.2e-16 5016.1067591 6578.0466705
## 8 ManualLogModelwSTEM   0.5983 2.2e-16    0.1190901    0.1500586

Below, we have both an interactive graph of Actual and Predicted Median Salary and a portion of the table of each major’s Actual and Predicted Median Salary. Although our model predicted the log value of the median salary, we calculated the true predicted salary so that our table and graph are easier to understand.

For the second question which is the prediction of various major groups we mainly utilized the same dataset provided in question 1. The aim was to look at how to group the major categories and finding key features which allowed us to see which predictors performed well on our data. STEM jobs are hot in the market currently, having the largest major salaries and lowest unemployment rates.However, they have a reputation for being male dominated field. Thus, we wanted to see which of these assumptions about STEM jobs were true. A classification approach was ideal here, as we could use models such as Logistic Regression and K- nearest neighbors to decide which of the predictors were useful in predicting whether a major was STEM and non-STEM. Before looking at STEM vs. non-STEM, we wanted to gauge how the majors were being clustered based on all the predictors. As we can see the cluster plot doesn’t have many distinct clusters, however we can extract some useful information from the clusters. In the cluster means chart we see that cluster 7 has a higher median salary, lower rate of unemployment and lower rate of women likely displaying a few STEM majors. Although this shows us some information, we can get a better picture by using the classification methods mentioned before.

## [1] "Diagram of the Different Clusters and the Values"
##    ShareWomen Unemployment_rate    Median Full_time_rate Part_time_rate
## 1   0.4579469        0.06451911  44881.82      0.6622497      0.2017554
## 2   0.7466545        0.08261862  25057.14      0.5988601      0.2617587
## 3   0.5911850        0.06674466  34714.29      0.6705945      0.2380143
## 4   0.5037521        0.06682306  46666.67      0.7467099      0.1746597
## 5   0.6488409        0.07578452  31000.00      0.5928351      0.3139806
## 6   0.2951070        0.07162313  50615.38      0.7013033      0.1912244
## 7   0.7672858        0.06556833  29218.75      0.6233107      0.2833793
## 8   0.4413965        0.07539435  38033.33      0.6689975      0.2022889
## 9   0.6122797        0.06958412  32100.00      0.6449989      0.2642523
## 10  0.4507461        0.06455098  40321.74      0.6886666      0.1959549
## 11  0.2812846        0.04496628  56016.67      0.7123062      0.1602524
## 12  0.1205643        0.01838053 110000.00      0.7905088      0.1154339
## 13  0.5959225        0.07323429  36120.00      0.6594971      0.2226075
## 14  0.1207341        0.06382103  72666.67      0.7456935      0.1664928
## 15  0.6617689        0.07129694  33116.67      0.6063580      0.2553743
## 16  0.2867549        0.07173579  61400.00      0.7064755      0.1639565
##    Full_time_year_round_rate College_jobs_rate Non_college_jobs_rate
## 1                  0.5101308         0.3417019             0.2717890
## 2                  0.4396190         0.3425649             0.3517315
## 3                  0.4952972         0.3472157             0.3547422
## 4                  0.5908781         0.2571738             0.3737175
## 5                  0.4364631         0.3042418             0.4136695
## 6                  0.5331848         0.3936544             0.2564933
## 7                  0.4652809         0.3403315             0.3946707
## 8                  0.5128422         0.2590758             0.3745803
## 9                  0.4676389         0.3802660             0.3564395
## 10                 0.5375857         0.3046191             0.3419026
## 11                 0.5582938         0.4642425             0.2416619
## 12                 0.5160325         0.6558358             0.1556221
## 13                 0.4890615         0.3147807             0.3461480
## 14                 0.4868344         0.4720607             0.2088785
## 15                 0.4459732         0.3244849             0.3512596
## 16                 0.5318566         0.5321552             0.1623469
##    Low_wage_jobs_rate
## 1          0.06139623
## 2          0.14521260
## 3          0.09610236
## 4          0.07027439
## 5          0.13409316
## 6          0.05945863
## 7          0.12774054
## 8          0.08230249
## 9          0.11847901
## 10         0.07937323
## 11         0.06324116
## 12         0.08251389
## 13         0.08073916
## 14         0.02204586
## 15         0.10942453
## 16         0.05108260

A common classification discriminant algorithm, logistic regression, was ideal for us in predicting whether a major would classify as a STEM major. Since we had various quantitative predictor variables in our data set like ShareWomen, Unemployment Rate, Median Salary and Full Time Rate, we needed to be able to find the best few factors to use. Using the bestglm function we found the best models that consisted of the lowest criterion score as shown below. The best model consisted of share of women, median salary, rate of people in full time jobs, rate of jobs that require a college degree, and the rate of jobs that had a low wage. We also wanted to consider how some of the other models did so we took the model with all predictors in the recent grads, the best model, and then the second and third best models which have their properties shown with a “TRUE” for the included properties in the table below.

In order to further compare these model we used a test/train split cross validation approach to train our models and then predict values in the test set using an 80:20 split. After applying the model to the test set we accumulated values for the accuracy and the errors for each model. What we saw was that the model with all the variables had the highest accuracy and the lowest errors, which is different from the results that the best glm function gave us. We saw the lowest accuracy for the best model achieved from the bestglm function which shows us that there may have been a greater number of STEM majors, giving us a variance in the sensitivity and specificity values due to a higher number of majors being predicted as STEM. Overfitting could be another area of concern with the full model since it might work best on this data, but not with other sources. Furthermore, we did see however high accuracy values, between 82% and 88%, showing us that the logistic regression did have some significant predictors like Share of Women, Full Time Rate and Median salary, which had the lowest p-values in the logistic full model.

##   ShareWomen Unemployment_rate Median Full_time_rate Part_time_rate
## 1       TRUE             FALSE   TRUE           TRUE          FALSE
## 2       TRUE             FALSE  FALSE           TRUE          FALSE
## 3       TRUE             FALSE   TRUE           TRUE          FALSE
## 4       TRUE             FALSE   TRUE           TRUE           TRUE
## 5       TRUE              TRUE   TRUE           TRUE          FALSE
##   Full_time_year_round_rate College_jobs_rate Non_college_jobs_rate
## 1                     FALSE              TRUE                 FALSE
## 2                     FALSE              TRUE                 FALSE
## 3                     FALSE              TRUE                  TRUE
## 4                     FALSE              TRUE                  TRUE
## 5                     FALSE              TRUE                 FALSE
##   Low_wage_jobs_rate Criterion
## 1               TRUE  183.0635
## 2               TRUE  183.7300
## 3               TRUE  183.8820
## 4               TRUE  184.0204
## 5               TRUE  184.4662
## [1] "This is the summary of the best Model"
## 
## Call:
## glm(formula = STEM ~ ShareWomen + Median + Full_time_rate + College_jobs_rate + 
##     Low_wage_jobs_rate, family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.78516  -0.79836   0.06789   0.79340   2.37312  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         7.919e+00  3.077e+00   2.573  0.01008 *  
## ShareWomen         -3.529e+00  1.264e+00  -2.793  0.00523 ** 
## Median              6.844e-05  3.511e-05   1.950  0.05122 .  
## Full_time_rate     -1.242e+01  3.126e+00  -3.972 7.12e-05 ***
## College_jobs_rate   3.365e+00  1.509e+00   2.230  0.02574 *  
## Low_wage_jobs_rate -1.749e+01  6.563e+00  -2.664  0.00772 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 189.92  on 136  degrees of freedom
## Residual deviance: 131.92  on 131  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 143.92
## 
## Number of Fisher Scoring iterations: 5
## [1] "Full Model"
##    Accuracy        R2      RMSE       MAE
## 1 0.8857143 0.5924948 0.3380617 0.1142857
## [1] "Best bestglm Model"
##    Accuracy        R2      RMSE       MAE
## 1 0.8285714 0.4386395 0.4140393 0.1714286
## [1] "Second Best Model"
##    Accuracy        R2      RMSE       MAE
## 1 0.8571429 0.5108789 0.3779645 0.1428571
## [1] "Third best Model"
##    Accuracy        R2      RMSE       MAE
## 1 0.8285714 0.4608004 0.4140393 0.1714286

After finding the most reliable predictors to be share of women, full time rate and median salary, we wanted to look at how closely surrounding data points will be used in predicting whether a major would be STEM or not. A method known as K-nearest neighbors was used, allowing us to see the optimal number of neighbors that would predict values from the best bestglm model found earlier. By using this method and looking at the accuracy score for each value of k, we found the optimal k to be 3 with a 76.1% accuracy rate. The graph below shows the accuracy for each k neighbors, with the max being the peak at k= 3. The accuracy starts high and then goes up till k= 3, then decreasing till we reach our max k at 171, which was the number of data points we had. This shows that majors having values close to each other will be optimal predictors, while as we introduce other points the accuracy of the prediction will dramatically decrease.

#CONCLUSION In our first question, we attempted to build a model that would best predict the median salary for all students in a given major. Our best model used the total number of students in the major, the share of those students that are women, the number of people working part time jobs, the number of people working non-college jobs, the interquartile range of the major’s salary, whether or not the major is STEM, and the interaction between Total and ShareWomen to predict the log of each major’s median salary. Of these, all of the terms were significant at the alpha=0.05 significance level except for our interaction term, and we had an R-Squared of 59.83%, meaning that our model accounted for nearly 60% of the variation in the data.

As far as the second question goes, we were mainly trying to group our data into stem and non-stem. The cluster chart gave us useful information as to what majors are similar to each other and how some of the trends in the data were looking. Majors with higher median salaries are likely to be clustered majors with lower unemployment rate, and lower shares of women. We then found a good classification of STEM majors with logistic regression to use a combination of share of women, median salary, rate of people in full time jobs, rate of jobs that require a college degree, and the rate of jobs that had a low wage. Although we saw lower accuracy rates for this model when performing the cross validation, it was likely due to the other models selecting STEM more frequently, and getting the prediction right due to a higher percentage of samples being STEM. This model was however had 82% accuracy so we could accurately predict whether a major was STEM or not. Lastly, we tested K-nearest neighbors to find out how neighboring points would be sufficient in the determination if a major was likely to have STEM-like unemployment and salary properties. We eventually determined that 3 neighbors was sufficient and provided the best accuracy at around 76%.

As we mentioned in our introduction, the median starting salary for a major is important because it impacts many people’s decisions when deciding what they want to do in their career. If we had more data on each major, such as what schools people went to, their average GPAs, etc, our model would likely perform even better in predicting a major’s median starting salary. Moreover, it is not surprising that there are many students pursuing STEM as career paths, as the clustering charts and the logistic model showed job stability through high median salaries, low unemployment rates and higher full time rates which were good predictors of a major being STEM or not. The share of women was also insightful here because we see a lot of people influencing high school girls to enroll in STEM majors, due to the lower ratio of women to men. The logistic regression showed that the share of women was significant in predicting a STEM major, showing that there is some discrepancy between the ratio of men and women in STEM fields. Something to further look at would be women enrollment in these majors over time to see if the current efforts to introduce women to STEM are indeed effective.